Golang Job: Site Reliability Engineer

Job added on

Company

OutSystems
Portugal

Location

Remote Position
(From Everywhere/No Office Location)

Job type

Full-Time

Golang Job Details

As the #1 low-code application development platform, OutSystems provides customers with everything they need to build apps incredibly fast. So let’s cut to the chase.
We are looking for a Site Reliability Engineer for our Cloud Networking team.

Resiliency doesn’t happen by accident. That’s particularly true for large-scale, massively distributed systems that run in the Cloud. It needs to be deliberately engineered into systems, and considered throughout the entire development lifecycle, from early design to operations.
“Site Reliability Engineering is what happens when you ask a software engineer to design an operations team.”

At OutSystems, our Site Reliability Engineers (SREs), combine advanced Software Engineering practices with mature Operations skills in order to deliver and operate highly resilient systems at scale. SREs, ensure that our Cloud services meet the reliability and uptime requirements of our demanding enterprise customers. This is achieved with proactivity, through the practice of sound engineering practices and resilient design from day 0; as well as with reactively, through a well-defined and effective on-call rotation that runs 24x7.

SREs engineer our production systems to be run at scale, so that manual and repetitive work is fully eliminated. They follow blameless postmortems practices so that all incidents are well understood and problems are fixed at their root. Over time, they make our systems more robust, fault-tolerant and able to self-heal during the worst of outages and through the most unexpected circumstances.

SREs are experts in troubleshooting complex problems and can dig very deep into why systems break in production. In order to do that, they rely on observability practices like centralized logging, distributed tracing and anomaly detection. They shorten detection (MTTD) and recovery times (MTTR), by improving the accuracy of alarms and speed of troubleshooting.

SREs leverage the latest infrastructure automation best practices and the toolset offered by Cloud Providers, so that they multiply their effectiveness and reach bigger outcomes.

Key Responsibilities and Skills:
  • Automate highly scalable and resilient cloud operations that can be executed with no customer downtime;
  • Perform blameless root cause analysis on outages and ensure action items are done;
  • Fix resiliency problems wherever they are in the product, or collaborate with product teams to do it;
  • Monitor customer infrastructure, measuring availability and system health;
  • Collaborate with customer support in recovering from escalated outages;
  • Troubleshoot complex incidents in highly distributed systems;
  • Shorten time to detecting by improving the accuracy of alarms;
  • Be a key stakeholder in the design of cloud services so that they are resilient from day 0.

Minimum Qualifications and Skills:
  • Bachelor or Master Degree in Computer Science or similar.
  • 5+ years of experience in software development or operations.
  • Programming skills in a high-level language (Python, Golang, etc.).
  • Experience with automation and IaC (Terraform, CloudFormation, Ansible, etc.);
  • Experience in troubleshooting and debugging;
  • Availability to work in shifts and be part of the 24x7 on-call rotation;
  • Fluency in English and good communication skills.

Preferred Qualifications and Skills:
  • Experience with Cloud providers (AWS, Azure and GCP).
  • Experience with Docker and Kubernetes.
  • Experience with Ingress Controllers.
  • Experience with monitoring and troubleshooting complex distributed systems;
  • Experience in designing resilient and fault-tolerant systems;
  • Experience in debugging complex, distributed systems.
  • Understanding of OAuth 2.0 and OIDC.
  • Experience with AWS CloudFront and/or other CDNs would be ideal.
Location: Portugal, remote

What we have to offer you?

Working at OutSystems gives you the opportunity to change the world of software development! We are leading the evolution of the software development process by removing complexity from it, allowing developers to focus on delivering value and making a difference. But removing complexity from software development means that we need to solve it ourselves, so we will have lots of opportunities and complex challenges to offer you.

We don’t have many rules, but we have a lot of common sense. Our commitment to our culture is highlighted in The Small Book of the Few Big Rules. This commitment to culture has landed us in the top six Forbes Best Cloud Computing Companies and CEOs To Work For three years running. We foster a culture where people make a difference. And every day, we focus on making sure that we keep a startup and innovative mindset.

You will work with colleagues that are as smart, hardworking and driven as you – and spread all over the globe -, in a solid company that still keeps growing, changing and innovating, and giving teams room to be proactive and creative.

Are you ready for the next step in your career? Then we’d love to hear from you!


#LI-DS1
#LI-Remote